Walmart, the retail giant that operates a chain of hypermarkets, wants to understand their weekly sales data, especially the impact from holidays and big events; specifically, Super Bowl, Labor Day, Thanksgiving, and Christmas. In addition, Walmart wants to consider the effect from different macroeconomic and external factors on the weekly sales.
At the end of this session, you will know how to
pandasmatplotlib and seaborn to extract insights sklearn libraryNote: if you see code that's unfamiliar to you, look up for the documentation, and try to understand what it does.
Original sales data were collected from 45 stores across the United States; however, for this session, you will first inspect data from three stores and later focus on just store 1.
Each store is of certain type and size, and there are multiple departments in a store.
The dataset has a temporal component, we mostly ignore this in this session and will discuss time series related techniques later in the cohort.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # allow multiple outputs in a cell
import warnings
warnings.filterwarnings("ignore")
Built on top of numpy, pandas is one of the most widely used tools in machine learning. Its rich capabilities are used for exploring, cleaning, visualizing, and transforming data. We need to import the library to access all of its capabilities.
import pandas as pd
Use pd.read_csv to load train_comb.csv that contains weekly sales, metadata, and macroeconomic features from three stores into a pandas DataFrame.
!pwd
/Users/Gilles/Documents/GitHub/week-04-data-eng-airflow/nb
!ls ..
README.md imports.ipynb requirements.txt workflow-airflow dat nb walmart2.ipynb
filepath = '../dat/train_comb.csv'
data = pd.read_csv(filepath)
Verify that the data is loaded correctly by running data.head(3) to see the first few row ( AVOID printing out the entire DataFrame, i.e., data or print(data); it might be trivial for small dataset but it can crash your kernel when the dataset is big and slow down the initial data exploration process ).
data.head(2)
| Store | Dept | Date | Weekly_Sales | IsHoliday | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Type | Size | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 2010-02-05 | 24924.50 | False | 42.31 | 2.572 | NaN | NaN | NaN | NaN | NaN | 211.096358 | 8.106 | A | 151315 |
| 1 | 1 | 1 | 2010-02-12 | 46039.49 | True | 38.51 | 2.548 | NaN | NaN | NaN | NaN | NaN | 211.242170 | 8.106 | A | 151315 |
data.sample(2)
| Store | Dept | Date | Weekly_Sales | IsHoliday | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Type | Size | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3439 | 1 | 26 | 2010-10-22 | 6316.22 | False | 69.86 | 2.725 | NaN | NaN | NaN | NaN | NaN | 211.861294 | 7.838 | A | 151315 |
| 2897 | 1 | 22 | 2011-05-20 | 5595.39 | False | 67.63 | 3.907 | NaN | NaN | NaN | NaN | NaN | 215.733920 | 7.682 | A | 151315 |
data.shape
(30990, 16)
data.describe()
| Store | Dept | Weekly_Sales | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Size | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 30990.000000 | 30990.000000 | 30990.000000 | 30990.000000 | 30990.000000 | 11062.000000 | 9260.000000 | 10779.000000 | 10990.000000 | 11062.000000 | 30990.000000 | 30990.000000 | 30990.000000 |
| mean | 6.050145 | 44.513746 | 26087.914360 | 61.278170 | 3.240187 | 9542.600315 | 4586.540622 | 2063.435241 | 4676.357248 | 6378.656460 | 157.540934 | 6.860820 | 192481.975605 |
| std | 5.113624 | 29.835120 | 32376.766734 | 17.113568 | 0.412234 | 8561.277370 | 12377.699214 | 12321.189247 | 7473.371273 | 4240.184513 | 41.179951 | 1.198274 | 29470.834745 |
| min | 1.000000 | 1.000000 | -898.000000 | 16.940000 | 2.514000 | 410.310000 | -10.500000 | 0.220000 | 8.000000 | 554.920000 | 126.064000 | 3.879000 | 151315.000000 |
| 25% | 1.000000 | 20.000000 | 4539.080000 | 47.960000 | 2.837000 | 4539.940000 | 67.000000 | 9.800000 | 983.650000 | 3719.380000 | 127.859129 | 5.965000 | 151315.000000 |
| 50% | 4.000000 | 38.000000 | 12941.920000 | 63.930000 | 3.294000 | 7146.900000 | 193.260000 | 50.600000 | 2421.080000 | 5563.920000 | 130.454621 | 7.193000 | 205863.000000 |
| 75% | 13.000000 | 72.000000 | 35645.547500 | 76.800000 | 3.610000 | 11075.380000 | 3579.210000 | 157.620000 | 5005.960000 | 7740.270000 | 211.522460 | 7.808000 | 219622.000000 |
| max | 13.000000 | 99.000000 | 385051.040000 | 91.650000 | 3.907000 | 53423.290000 | 89121.940000 | 109030.750000 | 57817.560000 | 31844.200000 | 223.444251 | 8.623000 | 219622.000000 |
❓ Question 1:
Look at the output of the previous cell to get an idea of each column and then write a few sentences describing what you notice about the data.
You can also use data.sample(3) to draw random samples from the data (hint: look into the number of rows and columns, document any missing values, what the data types are and what the date ranges of the data collected are etc.).
The table show interesting data on min max and number of values. so this show missing values in markdown, negative values for sales. A lot of numbers, some are kind of useless (25% percentile of the Dept)
I opened the csv file in excel and found the following: The data shows weekly sales for three stores and 80 departments during the years 2010, 2011, 2012. It includes markdowns - which are often not there for ex markdown1 is missing 19928 rows (total rows 130990 - count MarkDown1 11062) --> It includes temperature and fuel price, maybe to calculate heating-AC-refrigeration costs / transportation cost CPI : consumer price index? or some measurement of what a customer purchases in a store Unemployment: maybe hours a week the store is closed? Size: square feet of the store: store 1: 151,315 store4: 205,643 store 13: 219,622 Type: always = A Is Holiday: shows if the week has a holiday. Looks like they do week of christmas, Thanksgiving, Feb11 (do not know which holiday this is), labor day. Sept 8-9 No July 4th Strange
About sales, largest is 385,051.04 for store 4 department 72 week 11/25/2011. All large sales are the week of thankgiving and christmas departement 72 and 92 for stores 4 and 13. Store 1 best sales (203,670 21st position) were also on 11/25/2011. Maybe this is the Turkey department. sales can be negative. Numbers are below 1000 dollar. Maybe it is inventory time?
Acceptable responses include the number of rows and columns in the dataset, the data types of the elements, how many NaNs there are (and perhaps which columns and/or rows tend to have them), the range of values in each column or other descriptive statistics, some commentary on what this data represents, any initial concerns about how you think we should model this data, or any other commentary you would like to add.
Use .shape to inspect the size of the data: sample size and number of features.
# YOUR CODE HERE
data.dtypes
Store int64 Dept int64 Date object Weekly_Sales float64 IsHoliday bool Temperature float64 Fuel_Price float64 MarkDown1 float64 MarkDown2 float64 MarkDown3 float64 MarkDown4 float64 MarkDown5 float64 CPI float64 Unemployment float64 Type object Size int64 dtype: object
Extract the data for store 1 into a separate DataFrame.
mask = (data['Store']==1)
data_store1 = data[mask]
Let's take a random department, for example department 9, and retrieve the data.
data_store1.shape
(10244, 16)
mask9 = (data_store1['Dept']==9)
data_store1_dept9 = data_store1[mask9]
Verify the result using .head(), .shape.
data_store1_dept9.head()
| Store | Dept | Date | Weekly_Sales | IsHoliday | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Type | Size | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1144 | 1 | 9 | 2010-02-05 | 16930.99 | False | 42.31 | 2.572 | NaN | NaN | NaN | NaN | NaN | 211.096358 | 8.106 | A | 151315 |
| 1145 | 1 | 9 | 2010-02-12 | 16562.49 | True | 38.51 | 2.548 | NaN | NaN | NaN | NaN | NaN | 211.242170 | 8.106 | A | 151315 |
| 1146 | 1 | 9 | 2010-02-19 | 15880.85 | False | 39.93 | 2.514 | NaN | NaN | NaN | NaN | NaN | 211.289143 | 8.106 | A | 151315 |
| 1147 | 1 | 9 | 2010-02-26 | 15175.52 | False | 46.63 | 2.561 | NaN | NaN | NaN | NaN | NaN | 211.319643 | 8.106 | A | 151315 |
| 1148 | 1 | 9 | 2010-03-05 | 24064.70 | False | 46.50 | 2.625 | NaN | NaN | NaN | NaN | NaN | 211.350143 | 8.106 | A | 151315 |
data_store1_dept9.shape
(143, 16)
Let's visualize one full year of weekly sales. First, sort the data by date. Then, plot the first 52 weeks of the weekly sales over time.
# First, sort values by date
data_store1_dept9 = data_store1_dept9.sort_values('Date')
# Then plot the weekly sales for the first 52 weeks, by setting the indez to Date
data_store1_dept9[['Date', 'Weekly_Sales']].iloc[:52]\
.set_index('Date').plot(rot=90);
❓ Question 2:
Do you have any hypotheses about the holidays' impact on the sales?
It looks like people are buying before holidays and not during holidays. However the data have a yes/no holiday field that is different (christmas is not a holiday)
For the purpose of this notebook, we focus on the sales data from Store 1 that is saved in train_store1.csv.
Let's read in the data.
df = pd.read_csv("../dat/train-store1.csv")
Let's work on extracting week, month, and year information from the Date column to be able to better manipulate the weekly data. Pandas comes with powerful capabilities to make this step easy. Reference: tutorial.
Before extracting the data, use .dtypes to check the datatype of the Date column.
What's the difference between df[['Date']] and df['Date']?
df.head()
| Store | Dept | Date | Weekly_Sales | IsHoliday | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Type | Size | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 2010-02-05 | 24924.50 | False | 42.31 | 2.572 | NaN | NaN | NaN | NaN | NaN | 211.096358 | 8.106 | A | 151315 |
| 1 | 1 | 1 | 2010-02-12 | 46039.49 | True | 38.51 | 2.548 | NaN | NaN | NaN | NaN | NaN | 211.242170 | 8.106 | A | 151315 |
| 2 | 1 | 1 | 2010-02-19 | 41595.55 | False | 39.93 | 2.514 | NaN | NaN | NaN | NaN | NaN | 211.289143 | 8.106 | A | 151315 |
| 3 | 1 | 1 | 2010-02-26 | 19403.54 | False | 46.63 | 2.561 | NaN | NaN | NaN | NaN | NaN | 211.319643 | 8.106 | A | 151315 |
| 4 | 1 | 1 | 2010-03-05 | 21827.90 | False | 46.50 | 2.625 | NaN | NaN | NaN | NaN | NaN | 211.350143 | 8.106 | A | 151315 |
df['Date'] # list versus dataframe
0 2010-02-05
1 2010-02-12
2 2010-02-19
3 2010-02-26
4 2010-03-05
...
10239 2012-08-31
10240 2012-09-07
10241 2012-09-14
10242 2012-10-05
10243 2012-10-12
Name: Date, Length: 10244, dtype: object
df[['Date']]
| Date | |
|---|---|
| 0 | 2010-02-05 |
| 1 | 2010-02-12 |
| 2 | 2010-02-19 |
| 3 | 2010-02-26 |
| 4 | 2010-03-05 |
| ... | ... |
| 10239 | 2012-08-31 |
| 10240 | 2012-09-07 |
| 10241 | 2012-09-14 |
| 10242 | 2012-10-05 |
| 10243 | 2012-10-12 |
10244 rows × 1 columns
Then, convert the Date column to a datetime object.
df.Date=pd.to_datetime(df.Date)
Verify that the Date column's datatype has changed as expected.
df[['Date']].dtypes
Date datetime64[ns] dtype: object
# Hint: use the dt accessor
df['week'] = df.Date.dt.week
df['month'] = df.Date.dt.month
df['year'] = df.Date.dt.year
Verify that now there are 19 columns in your DataFrame.
# YOUR CODE HERE
df.shape
(10244, 19)
❓ Question 3:
Last step before we look deeper into the features is to split the data set into training and testing datasets.
Discuss: why do we want to perform EDA only on the training data, not the entire dataset? Shouldn't it be the more data the better?
We are going to build some analysis from the training data. so we better know what it looks like. This will allow us to identify which data are the most interesting and could be transformed - engineered
Split the data into training dataset (80%) and test dataset (20%). Use function train_test_split from scikit-learn ( a popular library for machine learning in Python ), and set random_state to be 42 for reproducibility ( this is not the best way to do train-test-split due to the temporal nature of the data, however, we will ignore it for now).
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df,test_size = 0.2,random_state=30)
# entered .2 to get the results
# do not know why random state should be 30 / 40 or else
print('Original set ---> ',df.shape,
'\nTraining set ---> ',df_train.shape,
'\nTesting set ---> ', df_test.shape)
Original set ---> (10244, 19) Training set ---> (8195, 19) Testing set ---> (2049, 19)
We inspected the datatype of column Date. Now, let's find datatypes for all columns in df_train.
df_train.dtypes# YOUR CODE HERE
Store int64 Dept int64 Date datetime64[ns] Weekly_Sales float64 IsHoliday bool Temperature float64 Fuel_Price float64 MarkDown1 float64 MarkDown2 float64 MarkDown3 float64 MarkDown4 float64 MarkDown5 float64 CPI float64 Unemployment float64 Type object Size int64 week int64 month int64 year int64 dtype: object
Summary statistics provide you with a general understanding of the data. Use method .describe(). By default it reports statistics mean, max, min, quantiles for numerical features and counts, unique, mode for categorical features.
pd.options.display.float_format = "{:,.2f}".format
df_train.describe()# YOUR CODE HERE
| Store | Dept | Weekly_Sales | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Size | week | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8,195.00 | 8,195.00 | 8,195.00 | 8,195.00 | 8,195.00 | 2,909.00 | 2,414.00 | 2,855.00 | 2,909.00 | 2,909.00 | 8,195.00 | 8,195.00 | 8,195.00 | 8,195.00 | 8,195.00 | 8,195.00 |
| mean | 1.00 | 44.50 | 22,120.57 | 68.19 | 3.22 | 8,031.35 | 2,941.32 | 1,303.00 | 3,729.31 | 5,048.08 | 215.98 | 7.61 | 151,315.00 | 25.86 | 6.46 | 2,010.96 |
| std | 0.00 | 29.96 | 28,182.60 | 14.26 | 0.42 | 6,455.09 | 7,871.15 | 8,058.11 | 5,932.50 | 3,271.40 | 4.32 | 0.38 | 0.00 | 14.21 | 3.26 | 0.79 |
| min | 1.00 | 1.00 | -863.00 | 35.40 | 2.51 | 410.31 | 0.50 | 0.25 | 8.00 | 554.92 | 210.34 | 6.57 | 151,315.00 | 1.00 | 1.00 | 2,010.00 |
| 25% | 1.00 | 20.00 | 3,483.89 | 57.79 | 2.76 | 4,039.39 | 40.48 | 6.00 | 577.14 | 3,127.88 | 211.53 | 7.35 | 151,315.00 | 14.00 | 4.00 | 2,010.00 |
| 50% | 1.00 | 38.00 | 10,377.62 | 69.31 | 3.29 | 6,154.14 | 137.86 | 30.23 | 1,822.55 | 4,325.19 | 215.46 | 7.79 | 151,315.00 | 26.00 | 6.00 | 2,011.00 |
| 75% | 1.00 | 72.00 | 32,128.43 | 80.49 | 3.59 | 10,121.97 | 1,569.00 | 101.64 | 3,750.59 | 6,222.25 | 220.43 | 7.84 | 151,315.00 | 38.00 | 9.00 | 2,012.00 |
| max | 1.00 | 99.00 | 203,670.47 | 91.65 | 3.91 | 34,577.06 | 46,011.38 | 55,805.51 | 32,403.87 | 20,475.32 | 223.44 | 8.11 | 151,315.00 | 52.00 | 12.00 | 2,012.00 |
❓ Question 4:
Inspect the output, what are some of your observations?
The introduction of the year/month/week show there are data for three years, 52 weeks, 12 months std standard deviation gives an idea of the spread of the data. all data are for store 1 so means std of the store number = 0 unemployment data are clustered markdown data are spread
Are there any missing values? Use .isna() and .sum() to show the number of missing values from each column.
# YOUR CODE HERE
df_train.isna().sum()
Store 0 Dept 0 Date 0 Weekly_Sales 0 IsHoliday 0 Temperature 0 Fuel_Price 0 MarkDown1 5286 MarkDown2 5781 MarkDown3 5340 MarkDown4 5286 MarkDown5 5286 CPI 0 Unemployment 0 Type 0 Size 0 week 0 month 0 year 0 dtype: int64
What do you think the target variable is in this problem? Assign the column name to target for later use.
target = 'Weekly_Sales'# it is all about sales YOUR CODE HERE
Visualize the distribution of target variable using distplot() from library seaborn ( Why seaborn? Check out a comparison between Matplotlib and Seaborn here ).
What do you observe visually that the output from .desribe was not making obvious? What is distribution of the target variable?
import seaborn as sns
sns.distplot(df_train[target],bins=10)
# YOUR CODE HERE and set the bins to 10
<AxesSubplot: xlabel='Weekly_Sales', ylabel='Density'>
Notice that there exists nonpositive weekly sales. How many rows are there that the weekly sales are negative or 0?
(df_train[target]<=0).sum()
# YOUR CODE HERE
# HINT Sum of values less than or equal to 0 in training data
# Expected Output: 13
13
What percentage is the negative and zero sales?
# there are 13 0 and negative sales
# there are 8195 records (training set length)
# 13/8195*100 = 0.16%
print("{:.2f}".format(13/8195*100)) # 2f 2 digits after the main number
# YOUR CODE HERE
# HINT if you think about target variable as a binary variable,
# with either negative and zero sales or positive sales,
# then the percentage of negative and zero sales is the percentage of 0s in the target variable
# which is the same as taking the average of the negative and zero sales
0.16
overall = df_train.shape[0]
neg_or0 = (df_train[target]<=0).sum()
percent = neg_or0/overall*100
print("{:.2f}".format(percent))
0.16
After communicating your findings, the stakeholders confirm that you can remove these data entries for now and they are launching an investigation with data analysts and data engineers.
Now remove the negative and zero sales from the training dataset.
mask = df_train[target] > 0
df_train = df_train[mask]# YOUR CODE HERE
df_train.shape # Expected Output: (8182, 19)
(8182, 19)
Let's move on to feature engineering.
Although almost all the features are numerical, should they all be treated as numerical features? Let's inspect the number of unique values.
What does this tell you?
[(col, df[col].nunique())for col in df_train.columns]
[('Store', 1),
('Dept', 77),
('Date', 143),
('Weekly_Sales', 10042),
('IsHoliday', 2),
('Temperature', 143),
('Fuel_Price', 137),
('MarkDown1', 51),
('MarkDown2', 41),
('MarkDown3', 49),
('MarkDown4', 51),
('MarkDown5', 51),
('CPI', 143),
('Unemployment', 12),
('Type', 1),
('Size', 1),
('week', 52),
('month', 12),
('year', 3)]
Temperature, CPI, Unemployment, Fuel_Price are continuous variables. Those tie to the second business objective of looking into the effects from different external factors on weekly sales.
Put these four features into a list and store it in external_factors.
Earlier we noticed that MarkDownx columns contain some missing values, so we will focus on them in a later task.
external_factors = ['Temperature','CPI','Unemployment', 'Fuel_Price']
Plot Temperature using a box plot.
Hint: Set the figure size as (6, 4), and turn off the grid.
❓ Question 5:
Visualize Temperature in a box plot, what do you think the advantage of a box plot over histogram?
YOUR ANSWER HERE
df_train.boxplot(column='Temperature');# YOUR CODE HERE
Plot all four numerical features using both distribution plot and box plot. Note any observations.
import matplotlib.pyplot as plt
print('\033[1mNumeric Features Distribution'.center(100))
figsize = (12, 4)
n=len(external_factors)
colors = ['g', 'b', 'r', 'y', 'k']
# histogram
plt.figure(figsize=figsize)
for i in range(len(external_factors)):
plt.subplot(1,n,i+1)
sns.distplot(df_train[external_factors[i]],
bins=10,
color = colors[i])
plt.tight_layout();
# boxplot
plt.figure(figsize=figsize)
for i in range(len(external_factors)):
plt.subplot(2,n,i+1)
df_train.boxplot(external_factors[i])
plt.tight_layout();
Numeric Features Distribution
We will investigate the impacts of the external factors later. Now let's scan through the other features.
Store, Type, and Size each have only one unique value, offering no information, so we can safely ignore them.
We extracted year, month, and week from Date, thus Date is redundant; but it is easy to find the date range in the training dataset using Date:
df_train['Date'].min(), df_train['Date'].max()
# Expected Output: (Timestamp('2010-02-05 00:00:00'), Timestamp('2012-10-26 00:00:00'))
(Timestamp('2010-02-05 00:00:00'), Timestamp('2012-10-26 00:00:00'))
Our training data ranges from 5th of February 2010 to 26th of October 2012.
It makes more sense to treat year, month, week as categorical variables, and even more accurately ordinal variables.
The boolean feature IsHoliday can be considered categorical, and so can Dept.
Let's put these column names into a list categoricalFeatures.
categoricalFeatures = ['year','month','week','IsHoliday', 'Dept']
For the categorical features, we are interested in the frequency of each value.
Use pandas method value_counts to find the number of rows where IsHoliday is true and false respectively.
print(df_train[categoricalFeatures[3]].value_counts())# YOUR CODE HERE
False 7605 True 577 Name: IsHoliday, dtype: int64
Visualize the distribution of month; use sns.countplot().
df_train['month']
3590 12
9297 6
22 7
9989 5
576 3
..
4859 8
919 4
500 6
4517 4
5925 7
Name: month, Length: 8182, dtype: int64
sns.countplot(x=df_train["month"]);# YOUR CODE HERE
Next, let's look into some plots of categorical features.
#Visualizing the categorical features
print('\033[1mVisualising Categorical Features:'.center(100))
plt.figure(figsize=(12,12))
for i in range(len(categoricalFeatures)):
plt.subplot(6,1,i+1)
sns.countplot(x=df_train[categoricalFeatures[i]])
plt.tight_layout();
Visualising Categorical Features:
❓ Question 6:
Discuss with your pair programming partner:
There is less data in 2012 than the previous two years. Did the sale drop from previous years? Does it affect what we see in the plots for month and week? Does the plot below clarify it to some degree?
REPLY
The plot below shows that there are no sales number after week 44? in 2012. From January 2012 to October 2012 the sales looked stronger in 2012 than in 2011/2010
plt.figure(figsize=(12, 6))
#sns.lineplot(data=df_train, x="week", y="Weekly_Sales", style='year');
sns.lineplot(data=df_train, x="week", y="Weekly_Sales",style='year');
YOUR ANSWER HERE
The first business objective is to understand the impact of holidays on weekly sales.
There is a feature IsHoliday that we can use to calculate the average weekly sales for holiday weeks and non-holiday weeks.
For this purpose, we can use .groupBy and .mean(). Are holiday sales higher?
df_train.groupby('IsHoliday').dtypes
| Store | Dept | Date | Weekly_Sales | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Type | Size | week | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IsHoliday | ||||||||||||||||||
| False | int64 | int64 | datetime64[ns] | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | object | int64 | int64 | int64 | int64 |
| True | int64 | int64 | datetime64[ns] | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | object | int64 | int64 | int64 | int64 |
df_train.groupby('IsHoliday')['Weekly_Sales'].head(2)
3590 3,190.20 9297 137,181.13 4016 2,844.88 8084 16,720.84 Name: Weekly_Sales, dtype: float64
df_train.groupby('IsHoliday')['Weekly_Sales'].mean()# YOUR CODE HERE
IsHoliday False 22,041.72 True 23,664.80 Name: Weekly_Sales, dtype: float64
But we would like to understand it at more granular level. Remember Simpson's paradox? To save some time, date mapping are identified for the training data
We provided the code to create a flag for each holiday to help you analyze weekly sale by each holiday type.
superbowl_mask = df_train['Date'].isin(['2010-02-12', '2011-02-11', '2012-02-10'])
laborday_mask = df_train['Date'].isin(['2010-09-10', '2011-09-09','2012-09-07'])
thanksgiving_mask = df_train['Date'].isin(['2010-11-26', '2011-11-25'])
christmas_mask = df_train['Date'].isin(['2010-12-31', '2011-12-30'])
df_train['superbowl'] = superbowl_mask
df_train['laborday'] = laborday_mask
df_train['thanksgiving'] =thanksgiving_mask
df_train['christmas'] = christmas_mask
Run the next cell to see:
df_train.groupby(['christmas'])\
.agg(count = ('christmas', 'size'),
avg_weekly_sales= ('Weekly_Sales','mean'))
| count | avg_weekly_sales | |
|---|---|---|
| christmas | ||
| False | 8067 | 22,182.57 |
| True | 115 | 20,304.58 |
Perform the same analysis for the other three holidays:
holidays = ['superbowl', 'laborday', 'thanksgiving', 'christmas']
for holiday in holidays:
summary_stats = df_train.groupby([holiday])
summary_stats.agg(count = (holiday, 'size'),
avg_weekly_sales= ('Weekly_Sales','mean'))
print(holiday, summary_stats," ")
print(holiday)
| count | avg_weekly_sales | |
|---|---|---|
| superbowl | ||
| False | 8011 | 22,112.65 |
| True | 171 | 24,195.12 |
superbowl <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fcba750fd00> superbowl
| count | avg_weekly_sales | |
|---|---|---|
| laborday | ||
| False | 8011 | 22,147.65 |
| True | 171 | 22,555.53 |
laborday <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fcba78b8e80> laborday
| count | avg_weekly_sales | |
|---|---|---|
| thanksgiving | ||
| False | 8062 | 22,073.51 |
| True | 120 | 27,710.04 |
thanksgiving <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fcba750fd90> thanksgiving
| count | avg_weekly_sales | |
|---|---|---|
| christmas | ||
| False | 8067 | 22,182.57 |
| True | 115 | 20,304.58 |
christmas <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fcba74fa0d0> christmas
Without hypothesis testing and only by eyeballing, it seems like Super Bowl and Thanksgiving have a positive impact on the weekly sales for store 1 in the training dataset.
Discuss with your teammate, are you surprised that during Christmas, sales at Walmart did not go up? Holiday effect, if causal, happened most during Thanksgiving weeks. Is this something you expected?
We have been ignoring Dept, let's take a look at the plot below showing the weekly sales by department in 2011.
plt.figure(figsize=(10,4))
sns.scatterplot(data=df_train[df_train.year==2011], x = 'Dept', y= target, hue='IsHoliday');
Dept 72 has a very unusual high weekly sales during the holiday week, but we will need more data to understand if this is data issue, outlier, or special event.
sns.lineplot(data=df_train, x="Fuel_Price", y="Weekly_Sales");
sns.lineplot(data=df_train, x="Temperature", y="Weekly_Sales");
sns.lineplot(data=df_train, x="CPI", y="Weekly_Sales");
sns.lineplot(data=df_train, x="Unemployment", y="Weekly_Sales");
By eyeballing, do you find strong evidence that these external and macroeconomic factors are correlated with Walmart's weekly sales? Do you think lineplot is an appropriate plot to visualize this relationship?
Lastly, we calculate the spearman correlations between the target and the external factors to verify that there is no strong linear correlation between the target variable and these features.
plt.figure(figsize=(6, 6))
df_train_reduced = df_train[[target] + external_factors]
corr = df_train_reduced.corr(method='spearman')
heatmap = sns.heatmap(corr.sort_values(by=target, ascending=False),
vmin=-1, vmax=1, annot=True, fmt='.1g', cmap='BrBG')
heatmap.set_title('Features Correlating with Sales Price', fontdict={'fontsize':12}, pad=16);
"Feature Engineering encapsulates various data engineering techniques such as selecting relevant features, handling missing data, encoding the data, and normalizing it. It is one of the most crucial tasks and plays a major role in determining the outcome of a model." Ref.
One part of feature engineering is to create new features from the given data, like the thanksgiving column that was earlier derived from Date.
Common techniques for tabular data include adding summary statistics of the numerical features such as mean and standard deviation, and creating new features from the interaction of multiple features, etc.
In this task, we will work on handling missing data, normalizing numerical features, and encoding categorical features.
First, let's focus on missing data. Missing value treatment is crucial, yet not trivial. Take a read on Tackling Missing Value in Dataset for detailed explanation. Features with nulls or wrong values (e.g., negative fuel price) needs to be imputed or removed.
From the earlier steps, we observed that only the markdown columns contain missing values, however we do not have more information on what those columns represent exactly.
df_train.columns[df_train.isna().sum() != 0]
Index(['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5'], dtype='object')
For each column, find out the percentage of the data is missing.
md_cols = ['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']
for col in ['MarkDown'+str(i) for i in range(1,6)]:
perc_missing = df_train[col].isna().mean()# YOUR CODE HERE; perc_missing:float
print (f'{col}: {perc_missing:.0%} is missing')
MarkDown1: 65% is missing MarkDown2: 71% is missing MarkDown3: 65% is missing MarkDown4: 65% is missing MarkDown5: 65% is missing
The majority of the markdown fields are missing. This is where, again, we need to communicate with the stakeholders to understand what the data measure, how the data was collected and then determine our strategy from there. Since we want to understand the impacts of MarkDownx on weekly sales, we will keep the features and impute the missing values. We have learned that there are tradeoffs with how we treat missing values and that our choice of imputation can be significantly impacted by extreme values and the amount of the missing data. We choose to impute with the median here to mitigate these negative impacts. Use .fillna() to impute the missing values.
df_train = df_train.fillna(df_train.median())# YOUR CODE HERE # this works for smaller dataset
(df_train.isna().sum() != 0).sum() # sanity check: 0
0
Visualize the distributions for those markdown fields after imputations, are they normal?
plt.figure(figsize=figsize)
for i in range(len(md_cols)):
plt.subplot(1,len(md_cols),i+1)
sns.distplot(df_train[md_cols[i]],
hist_kws=dict(linewidth=2),
bins=10,
color = colors[i])
plt.tight_layout();
# boxplot
plt.figure(figsize=figsize)
for i in range(len(md_cols)):
plt.subplot(2,n,i+1)
df_train.boxplot(md_cols[i])
plt.tight_layout();
Note that missing values are different from outliers. Outliers, on the other hand, are feature values that are rare in nature. They can unncessarily skew the data and cause problem for modeling. Outlier treatment involves removing or imputing such values. One popular approach to identify outliers is IQR; that is, data points that lie 1.5 times of IQR above Q3 (third quartile) and below Q1 (first quartile) are outliers. Take a read on Detecting and Treating Outliers.
We will leave it as an optional exercise for you to identify outliers using IQR, and replace the outliers with the median.
df_train.head(2)
| Store | Dept | Date | Weekly_Sales | IsHoliday | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | ... | Unemployment | Type | Size | week | month | year | superbowl | laborday | thanksgiving | christmas | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3590 | 1 | 27 | 2010-12-17 | 3,190.20 | False | 49.84 | 2.87 | 6,154.14 | 137.86 | 30.23 | ... | 7.84 | A | 151315 | 50 | 12 | 2010 | False | False | False | False |
| 9297 | 1 | 92 | 2011-06-10 | 137,181.13 | False | 83.13 | 3.65 | 6,154.14 | 137.86 | 30.23 | ... | 7.68 | A | 151315 | 23 | 6 | 2011 | False | False | False | False |
2 rows × 23 columns
import numpy as np
outliers = []
def detect_outliers_iqr(data):
data = sorted(data)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
# print(q1, q3)
IQR = q3-q1
lwr_bound = q1-(1.5*IQR)
upr_bound = q3+(1.5*IQR)
# print(lwr_bound, upr_bound)
for i in data:
if (i<lwr_bound or i>upr_bound):
outliers.append(i)
return outliers# Driver code
sample = df_train['MarkDown1']
sample_outliers = detect_outliers_iqr(sample)
print("Outliers from IQR method: ", sample_outliers)
Outliers from IQR method: [410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 410.31, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 950.33, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 1164.46, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2086.18, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2230.8, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2585.85, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 2725.36, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3213.0, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3221.25, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3662.06, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3666.27, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 3965.73, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4039.39, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4139.87, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4298.16, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 4640.65, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5011.32, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5183.29, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5204.68, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5621.99, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5629.51, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 5762.1, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6074.12, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6086.21, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6118.56, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6186.19, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6237.83, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6277.39, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 6352.3, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7146.9, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 7218.13, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8077.89, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8351.4, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8624.56, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 8813.81, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9349.61, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 9873.33, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10121.97, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10309.58, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10331.04, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 10382.9, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 11436.22, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 12218.76, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 13925.06, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 15441.4, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 17212.52, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21290.13, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 21442.73, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 27584.78, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06, 34577.06]
Now let's see how we can normalize the data. For numerical features it means scaling the features to be of similar range. This step is crucial for machine learning algorithms that calculate distances between data (e.g., read The Importance of Feature Scaling.
For this task, let's keep Temperature from the external features, since it is the most linearly correlated with the target variable, though very weak and negative.
In addition, we include one markdown field. Since neither seems to follow normal distributions, it is safer to use MinMaxScaler from sklearn.preprocessing to transform features by scaling each feature to a given range (See discussion on Normalization vs Standardization)
from sklearn.preprocessing import MinMaxScaler
numericalFeatures = ['Temperature', 'MarkDown1']
df_train_num = df_train[numericalFeatures]
df_train_num.describe() # Check the summary statistics
| Temperature | MarkDown1 | |
|---|---|---|
| count | 8,182.00 | 8,182.00 |
| mean | 68.20 | 6,820.37 |
| std | 14.26 | 3,951.17 |
| min | 35.40 | 410.31 |
| 25% | 57.79 | 6,154.14 |
| 50% | 69.31 | 6,154.14 |
| 75% | 80.49 | 6,154.14 |
| max | 91.65 | 34,577.06 |
Instantiate a MinMaxScaler and fit using df_train_num:
scaler = MinMaxScaler()# creates a function from a class
scaler.fit(df_train_num).get_feature_names_out()
array(['Temperature', 'MarkDown1'], dtype=object)
Now transform training data df_train_num and store the resulting nparray in train_norm:
train_norm = scaler.transform(df_train_num)
Verify that both columns now have minimum 0 and maximum 1.
df_norm = pd.DataFrame(train_norm, columns = ['Temperature','MarkDown1'])
df_norm.head()
| Temperature | MarkDown1 | |
|---|---|---|
| 0 | 0.26 | 0.17 |
| 1 | 0.85 | 0.17 |
| 2 | 0.80 | 0.17 |
| 3 | 0.75 | 0.17 |
| 4 | 0.20 | 0.17 |
df_train_num.head()
| Temperature | MarkDown1 | |
|---|---|---|
| 3590 | 49.84 | 6,154.14 |
| 9297 | 83.13 | 6,154.14 |
| 22 | 80.48 | 6,154.14 |
| 9989 | 77.72 | 6,154.14 |
| 576 | 46.50 | 6,154.14 |
pd.DataFrame(train_norm, columns=df_train_num.columns).describe()
| Temperature | MarkDown1 | |
|---|---|---|
| count | 8,182.00 | 8,182.00 |
| mean | 0.58 | 0.19 |
| std | 0.25 | 0.12 |
| min | 0.00 | 0.00 |
| 25% | 0.40 | 0.17 |
| 50% | 0.60 | 0.17 |
| 75% | 0.80 | 0.17 |
| max | 1.00 | 1.00 |
Let's turn to categorical fatures. So far most Python packages for modeling do not accept strings as an input. Thus, encoding the categorical value to numerical value is a necessary step. Here, let's apply one-hot encoding on Dept and IsHoliday:
from sklearn.preprocessing import OneHotEncoder
categoricalFeatures = ['Dept', 'IsHoliday']
df_train_cat = df_train[categoricalFeatures]
ohe = OneHotEncoder(handle_unknown='ignore',sparse = False).fit(df_train_cat)
Transform the categorical features using one hote encoding ohe.
train_ohe = ohe.transform(df_train_cat)# YOUR CODE HERE
train_ohe.shape, df_train_cat.shape # Expected Output: ((8182, 79), (8182, 2))
((8182, 79), (8182, 2))
The number of columns explodes from 2 to 79.
Lastly we merge the processed numerical features with the processed categorical features using hstack in numpy:
import numpy as np
X_train = np.hstack([train_norm, train_ohe])
X_train.shape # sanity check: (8182, 81)
(8182, 81)
dfxtrain = pd.DataFrame(X_train)
dfxtrain.head(2)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.26 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 |
| 1 | 0.85 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 |
2 rows × 81 columns
What about the test data?
Yes, you need to apply the same data processing. But be careful for the data leakage!
We spare some copy + paste + edit and see how this can be done when we introduce pipeline next.
Even with less than 20 features in our dataset, there are many possibilities that you can do when preprocessing the data.
There is no one-fits-all approach; often you will find yourself experimenting with many combinations to achieve better modelling performance: Should I apply normalization or standardization? Do I remove the outliers or should I impute them? Do I impute the missing values with median or mean or 0? Answers to many of these questions is "It depends." (Have you heard Graduate Student Descent?) That means trial-and-error and it is not efficient to produce a notebook each time when you need to try something slightly different. You will get lost quickly. Pipeline is one useful tool.
Not only does Pipeline help streamline the process, and keeps the code modular, but also reduces the possibility of introducing errors and bugs.
In this task, we build the pipeline following the strategies used in the last task, run a simple linear regression model, and print out the model's performance. Note that there is minimal code required for you to implement, the key is to understand each step.
To avoid confusion, let's read the data again directly from train-store1.csv.
df = pd.read_csv('../dat/train-store1.csv')
df.shape
(10244, 16)
Separating the target y from the features X:
print(target)
Weekly_Sales
X, y = df.drop(columns=target), df[target]
Import Pipeline from submodule sklearn.pipeline
from sklearn.pipeline import Pipeline
Now we build a data transformation for numerical features following two steps: impute the missing values with the feature median (use SimpleImputer), followed by normalization (use MinMaxScaler).
Note - do not confuse "transformers" that are part of data preprocessing with the deep learning models with the same name.
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
numeric_features = ['CPI', 'MarkDown1']
numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median")),
("min_max_scaler",MinMaxScaler())# YOUR CODE HERE
])
For categorical features, we apply one hot encoding OneHotEncoder ( there are many other options; see Scikit-learn documentation ):
categorical_features = ['Dept', 'IsHoliday']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
Piece the numeric_transformer and categorical_transformer using ColumnTransformer:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
Lastly, let's append the regression model to preprocessing pipeline to complete a full prediction pipeline.
from sklearn.linear_model import LinearRegression
model = Pipeline(
steps=[("preprocessor", preprocessor), ("model", LinearRegression())]
)
The pipeline has been built! The rest is to
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Let's run the prediction!
model.fit(X_train, y_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median')),
('min_max_scaler',
MinMaxScaler())]),
['CPI', 'MarkDown1']),
('cat',
OneHotEncoder(handle_unknown='ignore'),
['Dept', 'IsHoliday'])])),
('model', LinearRegression())])
print("model score: %.3f" % model.score(X_test, y_test))
model score: 0.949
Optional: Discuss what type of Feature Selection strategy you would use to select the features.
In this exercise, you have learned the manual way to perform EDA. Doing EDA manually has the benefits of customization, but is also highly repetitive. For this reason, a lot of EDA can easily be automated! In automating our EDA, we can get to know our data more quickly and spend more time on feature engineering and modeling. Let's check out a library called SweetViz to see how we can automate EDA!
import sweetviz as sv
orig_data_report = sv.analyze(df)
orig_data_report.show_notebook()
| | [ 0%] 00:0…
SweetViz calculates the descriptive stats for each feature, along with its missing and duplicate value stats.SweetViz helps to detect numerical vs categorical datatypes.ASSOCIATIONS tab to explore associations/correlations!orig_data_report.show_html('walmart_data_report.html', open_browser=True)
Report walmart_data_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
compare_report = sv.compare([X_train, 'Train'], [X_test, 'Test'])
compare_report.show_notebook()
| | [ 0%] 00:0…